QTM 447 Lecture 7: Nonlinearities and Expressive Learning Methods

Kevin McAlister

February 4, 2025

Machine Learning

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

Recall that the goal of supervised machine learning is to learn some function that generally does a good job of mapping the input features to an outcome.

Generically, we assume that:

\[y = f(\mathbf x) + \epsilon\]

  • There exists a true function out there

Machine Learning

We would like to find an approximation to this function, \(\hat{f}(\mathbf x)\), that minimizes the generalization error:

\[\hat{f}(\mathbf x) = \underset{f^*(\mathbf )}{\text{argmin }} E_\mathcal T[\mathcal L(y - f^*(\mathbf x))]\]

Unfortunately, we don’t get to see this quantity and need to approximate it.

Machine Learning

Most commonly, we rely on the empirical risk minimization framework. Given training data:

\[E_\mathcal T[\mathcal L(y - f^*(\mathbf x))] = \expn \mathcal L(y_i - f^*(\mathbf x_i)) + \frac{\lambda}{N} C(f^*(\mathbf x))\]

where \(C()\) is a measure of the complexity of the approximating function.

  • Per the VC dimension proof, we know that the gap between the training error and the generalization error goes to 0 as \(N \to \infty\)

Machine Learning

We also know that generalization error can be decomposed into three components:

  • Bias: How far is the proposed function from the truth on average?

  • Irreducible Error: Given our feature set, how much variation in the outcome can we not explain?

  • Complexity: how complicated is the function and how much capacity does it have to overfit to the training data in a way that misrepresents \(\mathcal T\)?

Only the complexity component is mitigated by the size of the training data!

  • As \(N\) gets large, we are guaranteed that i.i.d. samples are more and more representative of \(\mathcal T\) and overfitting becomes less of a problem

Machine Learning

Let’s think about our ability to uncover the true function as \(N\) gets very large:

  • The complexity penalty goes away with \(N\)

  • The irreducible error is there to stay

What we’re left with is bias

Machine Learning

Recall that bias in the machine learning context is how far away we are over the entire function from the truth on average

  • In other words, how closely can we approximate the true functional form with our chosen learning method?

Machine Learning

Machine Learning

Very complicated classification problem in two dimensions.

  • Very large training set of 1 million observations

Let’s try to fit an ERM minimizing logistic regression.

\[\mathcal L(\bb) = -\expn y_i \log \theta_i + (1 - y_i) \log (1 - \theta_i)\]

\[\theta_i = \sigma(z_i)\]

\[z_i = \mathbf x_i^T \boldsymbol \beta\]

Machine Learning

Machine Learning

Unfortunately, there is no way for this linear classifier to uncover a curvy decision boundary!

  • We’ll always have bias

Therefore, our generalization error can never be minimal

  • 0 + Irreducible Error

The problem is that a linear decision boundary isn’t expressive enough to capture the complexity of the problem!

Machine Learning

How can we fix this?

One initial approach is to try to figure out a formula for the decision boundary and use that to alter the learning approach to appropriately capture the true function

  • For certain problems, a quadratic decision boundary might be enough and we can use something like QDA or Naive Bayes

The goal of machine learning is to learn from the data algorithmically

  • Additionally, there isn’t a closed form function for this particular decision boundary!

Universal Approximators

Instead, let’s restrict ourselves to learning methods that are universal approximators

Universal Approximator:

A learning method that can approximate any Borel measurable function from one finite-dimensional space to another with any desired nonzero amount of error.

Deciphering the nerd speak:

A universal approximator can uncover any mapping of \(\mathbf x\) to \(y\) that is continuous on a bounded subset of \(\mathbb R^P\)

Even less syllables:

It can learn any reasonable function

Universal Approximators

IMPORTANT!

Just because it can, doesn’t mean that it will.

Reason 1:

The optimization method used to try to learn the function can’t actually find the function

  • Non-convex loss spaces and local minima!

A practical reason that a universal approximator won’t find the truth.

Universal Approximators

IMPORTANT!

Just because it can, doesn’t mean that it will.

Reason 2:

The training algorithm may choose the wrong function due to overfitting

  • All we get is the training data, so a universal approximator may not uncover the minimum generalization error function

  • Confusing irreducible error for bias.

Universal Approximators

Regardless, universal approximators are desirable in machine learning

  • For the kinds of data that we expect to see frequently in more complex problems (think image analysis), universal approximators are capable of uncovering any wiggly functional form

  • Doesn’t require inducing prior knowledge of the problem in order to achieve generalizable functions

Universal Approximators

How can we know that something is a universal approximator?

True in most cases:

The learning method could theoretically achieve 0 training error for any training data set

  • We know that this isn’t really good for generalization

  • However, if \(N = \infty\), then we would like to be able to learn the true function from the infinite i.i.d. samples from \(\mathcal T\)!

  • e.g. no bias for the true functional form

Universal Approximators

In your previous classes, you’ve already discussed some examples of universal approximators

Which learning methods that you know of are universal approximators?

Universal Approximators

There’s a lot!

  • KNN

  • Explicit Polynomial Feature Transformations

  • Implicit Polynomial Expansions via Infinite Basis Kernels

  • Regression/Classification Trees

And neural networks with at least one hidden layer

  • There are a lot more methods that do this, just not used a lot for a variety of reasons

Universal Approximators

Let’s look over some of these and think about:

  • How do they work as universal approximators?

  • How well do we expect that they will generalize with \(N < \infty\)?

  • Computational time

KNN

KNN can represent any function as \(N\) gets large!

  • It’s a simple method that has this awesome property

KNN

KNN

KNN

KNN

KNN

KNN

KNN

Neat.

Why don’t we just use KNN all the time?

  • Think about high dimensional problems

KNN

Curse of Dimensionality:

For uniform coverage of the feature space, the number of training points (e.g. parameters) needed to fully represent it grows exponentially in the dimensionality.

KNN

KNN

KNN

The number of points (parameters) needed to effectively represent the space for a 1NN classifier is \(10^P\).

This is exponential blow-up in the necessary size of the parameter space.

This is the curse of dimensionality!

  • If the number of parameters needed to effectively model the space increases exponentially with the dimensionality of the feature space, we’re going to have problems for complex high dimensional problems.

KNN

The curse of dimensionality is a problem for learning:

  • When a learning method suffers from the curse of dimensionality, the number of implied parameters explodes relative to the size of training data

  • More parameters = higher complexity penalty and worse generalization performance. We know that the complexity grows linearly in the number of estimated parameters and the generalization error is roughly:

\[E_\mathcal T[\mathcal L(y - \hat{y})] = \expn \mathcal L(y_i - \hat{y}_i) + \mathcal O \left(\frac{P}{N} \right)\]

  • The training size offset is still inverse linear. So, we need an exponentially increasing number of training examples to achieve an equivalent offset!

KNN

Because of this problem, KNN is not expected to generalize very well when the dimensionality of the feature space is large!

  • We need an insane number of training examples to have good generalization performance.

KNN also does not scale well in \(N\) when making a prediction.

Prediction Time: \(\mathcal O(K \log N)\)

  • As the training set gets larger, the time needed to make one prediction grows.

  • This is not great.

Linear Models?

An aside:

Does the most basic linear model suffer from the curse of dimensionality?

\[\hat{y} = \mathbf x^T \boldsymbol \beta\]

Linear Models?

Nope.

As \(P\) increases, the number of model parameters increases linearly

The cost:

Lack of expressiveness and high bias for complicated problems.

Wiggly Linear Models?

It turns out, though, that the linear model can be altered to be a universal approximator through clever choices of basis expansion.

One such choice is global polynomials

\[\phi(\mathbf x) = [1,x_1,x_1^2,x_2,x_2^2,x_1^2*x_2,x_2^2x_1,.........]\]

  • All possible powers and interactions among variables!

Wiggly Linear Models?

Wiggly Linear Models?

Wiggly Linear Models?

Wiggly Linear Models?

Neat.

Why don’t we do this all the time?

A full polynomial expansion of degree \(d\) with \(P\) features has

\[(P + d) \choose d\]

features.

Using Stirling’s approximation, we can argue that this evaluates to roughly:

\[\frac{P^d}{d!}\]

Wiggly Linear Models?

Now, we’re blowing up exponentially in the degree of the polynomial!

As \(P\) increases and \(d\) is fixed, we’re still blowing up exponentially in the dimensionality of the feature space.

Wiggly Linear Models?

This is neither going to generalize well nor be particularly computationally efficient!

Generalization Gap:

\[\mathcal O \left( \frac{P^d}{Nd!} \right)\]

Training time:

\[\mathcal O \left(\frac{NP^{2d}}{d!} \right)\]

Prediction Time:

\[\mathcal O(P)\]

Kernel Methods

The problem with explicit polynomial methods is the explosion in the feature set as \(P\) gets large.

Let’s look at a prediction made using this method for a continuous outcome using the OLS estimator:

\[y_0 = \mathbf x_0^T \hat{\boldsymbol \beta} = \mathbf x_0^T (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y\]

Using a neat matrix identity called the Woodbury Identity, we can show that this is equivalent to:

\[y_0 = \mathbf x_0^T \mathbf X^T (\mathbf X \mathbf X^T)^{-1} \mathbf y\]

Letting \(\boldsymbol \alpha = (\mathbf X \mathbf X^T)^{-1} \mathbf y\), we’re left with:

\[y_0 = \mathbf x_0^T \mathbf X^T \boldsymbol \alpha\]

Kernel Methods

Putting this in sum notation:

\[y_0 = \sum \limits_{i = 1}^N \mathbf x_0^T \mathbf x_i \alpha_i\]

Under a linear model, the prediction at any point is equivalent to the sum of the inner product of the feature vector with each training feature vector multiplied by a weight.

  • In other words, for arbitrary \(\phi(\mathbf x)\), we can compute the predictions as the inner product of the expanded feature vectors!

Kernel Methods

This is important because of Mercer’s Theorem

Let \(\mathbf K\) be a \(N \times N\) positive definite kernel matrix:

\[\mathbf K = \begin{bmatrix} \mathcal K(\mathbf x_1, \mathbf x_1) & \ldots & \mathcal K(\mathbf x_1, \mathbf x_N) \\ \vdots & \vdots & \vdots \\ \mathcal K(\mathbf x_N, \mathbf x_1) & \ldots & \mathcal K(\mathbf x_N, \mathbf x_N) \\ \end{bmatrix} \]

where \(\mathcal K()\) is a symmetric function such that:

\[\sum \limits_{i = 1}^N \sum \limits_{j = 1}^N \mathcal K(\mathbf x_i , \mathbf x_j) c_i c_j \ge 0\]

Kernel Methods

Mercer’s theorem states that any viable \(\mathcal K(\mathbf x_i , \mathbf x_j)\) can be expressed as:

\[k_{i,j} = \phi(\mathbf x_i)^T \phi(\mathbf x_j)\]

where \(\phi()\) is determined as a function of the eigenvectors of the kernel matrix.

What does this mean?

  • For certain basis expansions for certain methods (like linear regression), we can express a large basis expansion over \(P\) features, a \(N \times P'\) matrix, as a smaller \(N \times N\) matrix

  • We can encode some really complicated basis expansions of size \(P'\) in a \(N\) vector.

Kernel Methods

This is useful because we can substitute \(\phi(\mathbf X)\) with \(\mathbf K\)!

  • Achieve really curvy results

One common example is the polynomial kernel:

\[k_{ij} = (\mathbf x_i^T \mathbf x_j + c)^d\]

which corresponds to feature map for each vector which is a scaled equivalent to the full polynomial expansion!

  • Note that this expansion is of finite length

Kernel Methods

The most common kernel is the RBF kernel:

\[k_{ij} = \exp \left[- \frac{\| \mathbf x_i - \mathbf x_j\|^2}{2 \gamma^2} \right]\]

where \(\gamma\) is a scale constant.

The RBF kernel has an infinite length feature map that is essentially like a polynomial expansion of infinite degree!

  • With small \(\gamma\), we have a guaranteed universal approximator

Kernel Methods

Prior to the computational boom in the early 2000s, RBF kernel regressions/SVMs were the dominant approach to flexible machine learning!

  • Take any feature matrix and transform is to a \(N \times N\) kernel matrix using the RBF kernel

  • Train a linear model on the altered feature space

  • Choose \(\gamma\) that yields the best generalization performance

Kernel Methods

This works pretty well because it doesn’t suffer from the curse of dimensionality

  • \(\mathbf K\) is always \(N \times N\)

  • Estimate \(N\) coefficients

So, the number of parameters increases linearly in the training set size!

  • No exponential explosion

Kernel Methods

Kernel Methods

Regression with the RBF kernel is a universal approximator!

Why don’t we use this all the time?

Two-fold problem:

  • Generalization

  • Computational Time

Kernel Methods

With \(N < \infty\), the RBF kernel can yield infinitely bad generalization performance

  • It’s too expressive for its own good

  • Tune \(\gamma\) and judge using \(K\)-fold CV

Computing and inverting a \(N \times N\) matrix can be real tough on your machine

  • If \(N\) is large, it just isn’t going to happen

Kernel Methods

RBF kernels are useful for modern machine learning methods, but do not scale particularly well!

  • If the bottleneck is in terms of training size instead of feature dimensions, we’ll quickly run into problems.

We’ll see the RBF kernel combined with more scalable modern methods soon

Local Approximations

Another regression style approach that yields universal approximations is the set of local approximators

  • In a neighborhood of the feature space, create a local flat/linear/quadratic/cubic approximation to the function

For low-dimensional feature spaces, the most common method of doing this is regression splines

Local Approximations

For a single feature, we can express this approximation as follows:

  • Divide the feature space into \(K\) distinct regions - call the points of division \(\boldsymbol \xi\)

  • With each region, use the points to fit a local low order polynomial approximation to the training points

  • With clever design, we can ensure continuity of our prediction function

Local Approximations

Local Approximations

Local Approximations

Local Approximations

Local Approximations

Local Approximations

These methods are great for low dimensional feature spaces, but don’t work well at all for \(P > 4\)

  • It’s just too many parameters for the number of observations!

The exception is the step approximation

  • Extend the idea of a step to a rectangle in the feature space

  • Just make the prediction the average/majority class in that region

Trees

This is the basis of tree based algorithms

  • Start by making a single binary split on one feature that minimizes some loss criterion

  • Make a new split

  • Continue

  • Stop when we’ve reached full depth and each training point is in its own rectangle!

Trees

Referred to as trees since the iterative splits can be organized in a tree like structure.

Tree based methods are universal approximators if they are grown out to full depth

Trees

Trees

Single decision trees will approximate any function with enough data!

  • Problems with single decision trees?

Trees

Single decision trees generalize really poorly

Curse of dimensionality:

The more features we have, the more training points we need to draw enough rectangles to effectively represent complicated spaces!

Overfitting everywhere!

  • Even though decision trees only have \(N\) parameters, the number needed to do a good job explodes exponentially as the dimensionality of the feature space increases.

Trees

Random Forests with Bagging present a really clever solution to this problem that doesn’t ruin the universal approximator properties of this algorithm

  • What is the key difference between a single decision tree and a random forest?

Trees

By limiting the number of features in each tree to a small number, each decision tree can grow to full depth and generalize well over that feature set!

Then bagging allows us to average over the trees in a meaningful way to get a good approximation to the true function without overfitting.

  • A really clever solution to the curse of dimensionality

Trees

Random forests are a really powerful algorithm for building predictive models for tabular data

  • Training data can be expressed as a \(N \times P\) training matrix with a single outcome

There are attempts to use random forests/boosted trees for complex outputs like text and images

  • These approaches are just dominated by state of the art deep learning approaches

Trees

One of the reasons that we don’t use trees all the time for really big data is that they scale pretty poorly in \(N\)

To grow a single decision tree to full depth:

\[\mathcal O(P N \log N)\]

  • It’s not that bad for a single decision tree

Trees

Random forests can be a problem, though

Let \(B\) be the number of trees with feature set size \(P'\) that we grow to full depth and average using the bagging approach

\[\mathcal O(B P' N \log N)\]

  • We can distribute each subtree over a cluster of CPUs to reduce time a little more

  • With enough computational power, not that bad

  • Random Forests also have new GPU implementations that speed up processing time

Trees

Random Forests (and boosted trees) work very well in certain settings

  • Tabular data, mostly

They are frequently shown to outperform neural network architectures in many cases!

Where they struggle, though, is with really complex data types

  • Images

  • Speech

  • Text

Places where deep learning has really produced huge improvements in the quality of leaning models!

Next Time

This is an overview of modern methods that work well in a lot of settings

  • But not all!

We have one more major universal approximator architecture to consider in neural networks

  • Gets around the curse of dimensionality (mostly) by engineering low dimensional representations of high dimensional feature spaces